Chris Pollett > Old Classes >
CS174
( Print View )

Student Corner:
  [Grades Sec1]

  [Submit Sec1]

  [Class Sign Up Sec1]

  [Lecture Notes]
  [Discussion Board]

Course Info:
  [Texts & Links]
  [Topics/Outcomes]
  [Outcomes Matrix]
  [Grading]
  [HW Info]
  [Exam Info]
  [Regrades]
  [Honesty]
  [Additional Policies]
  [Announcements]

HWs and Quizzes:
  [Hw1] [Hw2] [Hw3]
  [Hw4] [Hw5][Quizzes]

Practice Exams:
  [Mid1] [Final]

HW#4 --- last modified February 28 2019 14:15:12..

Solution set.

Due date: Nov 20

Files to be submitted:
Hw4.zip

Purpose: To build a more awesome search engine, to learn about XML, XPath, Ajax, and REST.

Related Course Outcomes:

LO2 -- Write schemas, DTDs, and style sheets for XML documents.

LO5 -- Develop and deploy web applications that involve components, web services, and databases.

Specification:

You will continue to use svn to work in groups on Hw4, and to submit the homework, will continue use my form to give me the url to your svn repository. This url will be used from the command-line as follows:

svn co url

If this does not check-out your homework it will not be graded.

The basic motivation for this homework is that in the future it will be cheap enough for everyone to have the computer memory needed to store caches of large portions of the Internet. Some people will prefer to search these caches on their local machines rather than go to a search engine company and do the search there. The advantage of doing a search on your own machine is that you are not telling a third party what you are searching for.

The slowest part of creating a viable search engine is doing the actual crawling of the Internet. To facilitate personal search, you could imagine a company which creates Internet caches for download. A compressed Internet caches of 100 million pages might be store-able in 10-20Gb, so with a decent connection, this might be feasible to download once-a-month.

For the first part of this homework, I would like you to design a file searchsummaries.dtd that would be useful for storing web page summaries. Storing web-site summary information as an XML document has been considered before. For example, for DMOZ, the summary information can be obtained from: http://rdf.dmoz.org/. Wikipedia also has a similar service for downloading snapshots of the whole Wikipedia site (actual pages rather than summaries). Your language should be geared toward the summaries produced by the Hw4-Crawler-Ranker-Engine Project-ZIP. You should check that your DTD is valid using Oxygen. You should also create a couple example documents which validate with respect to this dtd in Oxygen.

The Hw4-Crawler-Ranker-Engine Project illustrates a quite common feature of many web-sites: In addition, to the scripts which are used to display web-pages one quite often has several auxiliary scripts which might be run as cron jobs which calculate things or perform activities which are then used by the web-page scripts. In the case of this project there are three main php scripts: crawl.php - which implements a web crawler, rank.php - which implements a basic page rank calculator, and index.php which is a web-page used to implement a search engine. Neither crawl.php nor rank.php are intended to be run from the web, rather they are intended to be run from the command line. At its simplest you could create a database in mysql called mysearch, edit the config.php file and set some seed sites to crawl from in seedsites.php, then run at the command line the commands:

php crawl.php newcrawl
php rank.php

This would generate data in your mysearch database such that if you point your web browser to where index.php is under your document root, index.php will display a search page, and if you use this page, it will correctly display search results. You should be aware that if you are on a machine such as a Mac and you are using Xampp then you have multiple versions of php installed on your system. The version of php which is first in your path might not be the one that knows about the Xampp installation of Mysql, so you might have to edit your path for these scripts to work from the command line.

For the second part of your homework, you will extend the crawl.php auxiliary script. If you look at this script as it currently stands it can only extract links and data summaries from Html files. There are a series of functions towards the end of the script that do this: html_processor, html_dom, etc. Other kinds of Xml documents on the web also have links, and are suitable for summarization. For example, RSS. I would like you to extend crawl.php to support crawling of rss pages. To do this you need to add rss to the allowed page types in the config.php file and then write analogs of html_processor and the other html_ functions for rss. This will give you an opportunity to experiment with the PHP Xml processing functions such as DOMDocument and DOMXPath.

For the third part of the homework, I want you to experiment with Ajax. To do this, I want you to modify the index.php file so that when you hover over the link for a search result, you see a thumbnail with a picture of what the web-site looks like. To see an example of what this might look like, go to Ask.com, and do a search. Then hover your pointer over one of the binoculars you see on the page of search results. To generate an image of a web-page is not too terrible, but rather than have you figure that out, you can use an existing service with a REST API like: http://webthumb.bluga.net/home.

Crawler Derby

For this homework, there are several bonus points available. 2pts will be awarded if you can double the speed of the crawler over its baseline speed, and yet still retain that it is checking things like robot.txt files correctly and extracting and storing summaries correctly. 4 pts will be awarded if you can quadruple its speed. 6 points will be awarded to the fastest crawler that works correctly and goes at least four times the baseline speed. Timings will all be done on the same machine, running one instance of your program for 1 hour and comparing the times to running one instance of the original code for 1 hour. If you receive bonus points it will be indicated after your HW score, but will not be added to the sum of your homework score. Accumulated bonus points will be added to your total score for the semester after the grade curve for the semester has been calculated.

Point Breakdown

searchsummaries.dtd validates and could be used to store the kind of summaries stored in the Hw4-Crawler-Ranker-Engine Project (1pt each).	2pts
XML documents which validate according to searchsummaries.dtd (1/2pt each)	1pt
Revised crawler can process rss documents, extract links, and create summaries which match the CRAWL_ITEM table format. (1pt each)	3pts
index.php modified to display thumbnails (1pt), these thumbnail come from an Ajax request back to your site (1pt), your site makes a request of a thumbnail generator site and gets the results back (1pt)	3pts
bug.txt has a list of bugs that were worked on for your project and these match the svn commit log.	1pt
Total	10pts